-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add debug instrumentation for test_play_services
#1013
Conversation
Signed-off-by: Jorge Perez <[email protected]>
Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that? Sure, the issue may be from a separate repository, but this would be useful for providing data to the appropriate people in order to get the problem fixed... |
I think we should do both. This test has been failing for quite a long time, and it makes the lives of all ROS 2 developers difficult. So working around it is the right thing for the short-term, while a bug report to |
@jhdcs I'm not sure if this is what we want, I opened the PR with the intention of gathering more info from CI and receiving feedback from maintainers + ROS2 team as to how to proceed. This flaky error is affecting 50% of our https://build.ros2.org/view/Rci/job/Rci__nightly-release_ubuntu_jammy_amd64/ builds atm, and hasn't been addressed in a long time. We should do something. |
I agree that something should be done. And thank you for stepping up to the plate. I just don't want this PR to be used as an excuse for not fixing the underlying problem - which doesn't appear to be an issue here. |
@Blast545 @jhdcs @clalancette I would suggest some compromise solution. What I've found is that mostly failing two tests I have suspicious that there are some problem with timings or QoS. Although not sure in what place exactly. The higher failure rate with I've tried to think about what is common in those two I have a suspicious that failure happening for cases when service call has returning value. |
This feels very reasonable to me. Just in case it's not clear, my only concern is if the underlying issue is never addressed. As long as we're making concrete steps to identify and correct the issue, creating some stopgap measures is perfectly fine! From the rest of your post... It almost sounds like we may have a race condition going on... Does anyone else get that feeling? |
- Increased service_call_timeout_ up to 3 seconds. - Split `toggle_paused` on two separate tests `is_paused` and `toggle_paused` - Add output of the function name and line number in case of failure. Signed-off-by: Michael Orlov <[email protected]>
@Blast545 I pushed commit with additional debug info. |
@MichaelOrlov Sure thing! From the initial CI that I ran, I see that the second repeated + PR failed with |
Totally agree. |
@Blast545 What I can see is that test running twice per build and each time it fails on the same step or iteration. It's actually usually pointing out that we have somewhere a dangling pointer or dangling reference which is getting overwritten at some iteration depending from the current memory layout. I am unfortunately don't have more time to debug this issue deeper. And I am not very familiar with implementation of the service calls and SingleThreadedExecutor. We need to find someone who can continue analysis of this issue from this point. @wjwwood Could you please help with debugging failure in service calls? Or please advise someone who can help with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Blast545 @clalancette I would suggest to merge this PR with added debug instrumentation and increased wait time for service call.
I would like to see if it will still fail on CI in toggle_paused
or in new is_paused
test.
Yeah, I can help. It might be Monday though before I can catch up on the thread. I'd be happy to chat out of band on Monday as well. |
Signed-off-by: Jorge Perez <[email protected]>
test_play_services
I changed the title of this PR to better reflect the debug info added in @MichaelOrlov 's commit. Also restored the msgs_to_publish from 190 to 200 as the number 195 wasn't representative as we originally thought. |
Add redefinition for __PRETTY_FUNCTION__ to __FUNCSIG__ if it is not defined and not gcc compiler. Signed-off-by: Michael Orlov <[email protected]>
@Blast545 Sorry I forgot that |
Not a fix, but this PR should help with the
test_play_services
flaky tests described here: #862Follow up discussion: #862 (comment)
Only reduced the number of msgs to 190 to confirm that
rmw_fastrtps
fails with specifically 195 iterations.Only increased the timeout by 2 to avoid increasing the test time too much.
To test the PR I'll run two ci_linux jobs testing rosbag2_transport with the repeated jobs flags.
Signed-off-by: Jorge Perez [email protected]